feat(sglang): add ephemeral KV session routing#7665
Conversation
595ae55 to
9f89eb6
Compare
9f89eb6 to
f0f178d
Compare
WalkthroughThis pull request transitions cache control infrastructure from TTL-based prefix pinning to session-based lifecycle management. It replaces the Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Follow-up on this branch:
The current PR body / CodeRabbit summary is stale on the cache-control point after this update. |
|
Follow-up correction:
Validation rerun after this change:
|
Signed-off-by: Ishan Dhanani <ishandhanani@gmail.com>
- Lazily register workers in slot tracker when the selector picks a worker that the monitor task hasn't propagated yet. Eliminates "Worker not found" and "Failed to mark prefill completed" warnings. - Use no_fault_detection for the session_control event plane client so open_session RPCs don't fail due to a race between discovery propagation and the fault detection avail list. - Add missing router_enable_agent_controller field to pyo3 binding.
18ad724 to
429a47f
Compare
- _normalize_prompt_token_ids: accept any non-string iterable for input_ids, not just list, preventing dict-keys fallback bug - FakeBatchEncoding: move input_ids to instance attribute (Ruff RUF012) - init_llm: track session_control_endpoint in shutdown_endpoints - publisher: remove duplicate urlparse import - agents.md: fix session example to show proper append-style turns with stream consumption - agg_router.sh: derive YaRN rope_scaling.factor from CONTEXT_LENGTH
PeaBrane
left a comment
There was a problem hiding this comment.
One design question on feature gating: session_control is already a per-request opt-in, and requests that do not carry it naturally want the existing KV-routing behavior. Given that, it seems like a cleaner model would be to keep normal KV routing as the unconditional default and make the session-control machinery request-driven/lazy instead of startup-gated: only initialize and use sticky-session state plus the AgentController when a request actually carries nvext.session_control, and fail only those requests if the worker side does not expose session_control or streaming sessions. That would preserve today's behavior for ordinary traffic, remove one frontend config knob, and make the feature boundary line up more closely with the API surface.
Mirror finish() in the Drop impl: free the scheduler slot before firing the deferred session close, so the worker's KV is not released while generation teardown is still in progress. Also guard the close with try_current() to prevent panics outside a tokio runtime. Remove stale pub mod declarations for subscriber/worker_query (moved to kv_router/indexer/ in #7973).
Check for session_control presence before entering the sticky path, so ordinary KV-routed requests avoid the resolve call entirely.
# Conflicts: # components/src/dynamo/sglang/request_handlers/handler_base.py
…tartup-gated Remove --enable-agent-controller flag. AgentController and StickySessionRouter are now always created in KvPushRouter but activate lazily: the event-plane client is only initialized on the first request carrying session_control, and sticky resolution short-circuits for non-session requests. Addresses review feedback from #7665.
- AgentController uses OnceCell<Option<EventPlaneClient>> to cache unavailability. First session_control request with no endpoint gets a 5s timeout, logs a single WARN, and all subsequent requests skip session lifecycle silently. Requests proceed without isolation. - Handler-side guard: _session_kwargs() checks enable_streaming_session before injecting session_params, preventing SGLang "session does not exist" errors when sessions are off. - Downgrade AgentController/StickySessionRouter init logs to debug. - Update agg_agent.sh with arg parsing (--model-path, --tp, EXTRA_ARGS). - Add session control section to router-guide.md. - Update agents.md: remove stale upstream PR note, clarify request-driven activation, add SGLang-only note, fix OpenCode repo URL. - Add streaming session smoke test.
…l-kv-sessions # Conflicts: # docs/components/frontend/nvext.md # docs/components/router/router-guide.md
- Add realistic agentic prompts (GPU inference engine design workload) - Add main agent + subagent interleaving (matches real orchestrator pattern) - Add --mode kv-pressure for metrics-based KV reclamation validation - Handle reasoning models (reasoning_content fallback) - Add priority support for future ablations
Summary
--enable-agent-controllerflag needed. AgentController and StickySessionRouter activate lazily when requests carrynvext.session_control--enable-streaming-session, the router warns once and ignores session_control. Requests proceed without isolation.Design Changes (from review feedback)
Addressed review feedback:
--enable-agent-controllerflag -- session control is per-request opt-in, not startup-gatedOnceCell<Option<EventPlaneClient>>-- lazy init on first session request, caches unavailability if no endpoint existssession_control_session_kwargs()checksenable_streaming_sessionbefore injectingsession_paramsinto SGLang calls, preventing errors when sessions are offKV Pressure Benchmark
Controlled benchmark with interleaved main agent + subagent traffic. GLM-4.7-Flash TP=2 on 2x L40S. 60 requests: 3 main agent turns + 5 subagent sessions (9-15 turns each) replayed from OpenCode traces.
New metric
sglang:kv_physical_usage=1 - (available_size / total_pool)captures physical GPU memory occupied including evictable radix nodes.Streaming Sessions vs No Sessions
Priority Eviction Ablation
Same workload without streaming sessions but with
--radix-eviction-policy priority. Main agent requests atpriority=50, subagent requests atpriority=1.Result: priority eviction does not solve the problem.
Streaming sessions are fundamentally different -- they free KV proactively on session close, keeping headroom so the main agent's prefix cache survives across subagent cycles.
Validation
cargo check -p dynamo-llm+cargo check -p dynamo-py3(bindings)cargo test -p dynamo-kv-router(269 tests pass)components/src/dynamo/sglang/tests/test_streaming_session_smoke.pypython -m pytest -q components/src/dynamo/frontend/tests/test_sglang_processor_unit.py